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ABSTRACT 

There exist a variety of star-galaxy classification techniques, each with their own strengths and 
weaknesses. In this paper, we present a novel meta-classification framework that combines and 
fully exploits different techniques to produce a more robust star-galaxy classification. To demon¬ 
strate this hybrid, ensemble approach, we combine a purely morphological classifier, a supervised 
machine learning method based on random forest, an unsupervised machine learning method 
based on self-organizing maps, and a hierarchical Bayesian template fitting method. Using data 
from the CFHTLenS survey, we consider different scenarios: when a high-quality training set is 
available with spectroscopic labels from DEEP2, SDSS, VIPERS, and VVDS, and when the de¬ 
mographics of sources in a low-quality training set do not match the demographics of objects in 
the test data set. We demonstrate that our Bayesian combination technique improves the overall 
performance over any individual classification method in these scenarios. Thus, strategies that 
combine the predictions of different classifiers may prove to be optimal in currently ongoing and 
forthcoming photometric surveys, such as the Dark Energy Survey and the Large Synoptic Survey 
Telescope. 

Key words: methods: data analysis - methods: statistical - surveys - stars: statistics - galax¬ 
ies: statistics. 


1 INTRODUCTION 

The problem of source classification is fundamental to astronomy and 
goes as far back as [Messier mzM)- A variety of different strategies 
have been developed to tackle this long-standing problem, and yet 
there is no consensus on the optimal star-galaxy classification strat¬ 
egy. The most commonly used method to classify stars and galaxies in 
large sky surveys is the morphological separation (|Sebok|1979||Kron 
T^|Valdes|1982||Yee|199Tl|Vasconcellos et al.|201 l||Henrion et al. 

201 1^ . It relies on the assumption that stars appear as point sources 
while galaxies appear as resolved sources. However, currently ongo¬ 
ing and upcoming large photometric surveys, such as the Dark Ene^ 
Survey (DE^ and the Large Synoptic Survey Telescope (LSSljj, 
will detect a vast number of unresolved galaxies at faint magnitudes. 
Near a survey’s limit, the photometric observations cannot reliably 
separate stars from unresolved galaxies by morphology alone without 
leading to incompleteness and contamination in the star and galaxy 
samples. 

The contamination of unresolved galaxies can be mitigated by 
using training based algorithms. Machine learning methods have the 
advantage that it is easier to include extra information, such as con¬ 
centration indices, shape information, or different model magnitudes. 
However, they are only reliable within the limits of the training data, 
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and it can be difficult to extrapolate these algorithms outside the pa¬ 
rameter range of the training data. These techniques can be further 
categorized into supervised and unsupervised learning approaches. 

In supervised learning, the input attributes (e.g., magnitudes or 
colors) are provided along with the truth labels (e.g., star or galaxy). 
jOdewahn et al.|(T9^ pioneered the application of neural networks to 
the star-galaxy classification problem, and it has become a core part 
of the astronomical image processing software S Extractor ( jBertinj 
|& Arnouts|1996j ). Other successfully implemented examples include 
decision trees IWeir et al.|[T^ [Suchkov et aT]|2005l [BdreTar] 
|2006| [Sevilla-Noarbe & Et ayo-Sotos|2015^ and Support Vector Ma- 
chines ( |Fadely, Hogg & Willman|2012| |. Unsupervised machine learn¬ 
ing techniques are less common, as they do not utilize the truth labels 
during the training process, and only the input attributes are used. 

Physically based template fitting methods have also been used 
for the star-galaxy classification problem ([Robin et al.|2007||Fadely| 
[cTaLllMTl l. Template fitting approaches infer a source’s properties 
by finding the best match between the measured set of magnitudes 
(or colors) and the synthetic set of magnitudes (or colors) computed 
from a set of spectral templates. Although it is not necessary to ob¬ 
tain a high-quality spectroscopic training sample, these techniques do 
require a representative sample of theoretical or empirical templates 
that span the possible spectral energy distributions (SEDs) of stars 
and galaxies. Furthermore, they are not exempt from uncertainties 
due to measurement errors on the filter response curves, or from mis¬ 
matches between the observed magnitudes and the template SEDs. 

In this paper, we present a novel star-galaxy classification frame- 
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work that combines and fully exploits different classification tech¬ 
niques to produce a more robust classification. In particular, we show 
that the combination of a morphological separation method, a tem¬ 
plate fitting technique, a supervised machine learning method, and 
an unsupervised machine learning algorithm can improve the overall 
performance over any individual method. In Section]^ we describe 
each of the star-galaxy classification methods. In Section we de¬ 
scribe different classification combination techniques. In Section 
we describe the Canada-France Hawaii Telescope Lensing Survey 
(CFHTLenS) data set with which we test the algorithms. In Sectionj^ 
we compare the performance of our combination techniques to the 
performance of the individual classification techniques. Finally, we 
outline our conclusions in Section|^ 


2 CLASSIFICATION METHODS 

In this section, we present four distinct star-galaxy classification tech¬ 
niques. The first method is a morphological separation method, which 
uses a hard cut in the half-light radius vs. magnitude plane. The sec¬ 
ond method is a supervised machine learning technique named TPC 
(Trees for Probabilistic Classification), which uses prediction trees 
and a random forest ( [Carrasco Kind & Brunner||2013| ). The third 
method is an unsupervised machine learning technique named SOMc, 
which uses self-organizing maps (SOMs) and a random atlas to pro¬ 
vide a classification ( [Carrasco Kind & Brunner||2014b| ). The fourth 
method is a Hierarchical Bayesian (HB) template fitting technique 
based on the work by [Fadely et al.[ ( [20T^ , which fits SED templates 
from star and galaxy libraries to an observed set of measured fiux 
values. 

Collectively, these four methods represent the majority of all 
standard star-galaxy classification approaches published in the litera¬ 
ture. It is very likely that any new classification technique would be 
functionally similar to one of these four methods. Therefore, any of 
these four methods could in principle be replaced by a similar method. 


2.1 Morphological Separation 

The simplest and perhaps the most widely used approach to star- 
galaxy classification is to make a hard cut in the space of photometric 
attributes. As a first-order morphological selection of point sources, 
we adopt a technique that is popular among the weak lensing com¬ 
munity ( [Kaiser, Squires & Broadhurst|[1995[ l. As Figure shows, 
there is a distinct locus produced by point sources in the half-light ra¬ 
dius (estimated by S Extractor’s ELUX_RADIUS parameter) vs. 
the z-band magnitude plane. A rectangular cut in this size-magnitude 
plane separates point sources, which are presumed to be stars, from 
resolved sources, which are presumed to be galaxies. The boundaries 
of the selection box are determined by manually inspecting the size- 
magnitude diagram. 

One of the disadvantages of such cut-based methods is that it 
classifies every source with absolute certainty. It is difficult to jus¬ 
tify such a decisive classification near a survey’s magnitude limits, 
where measurement uncertainties generally increase. A more infor¬ 
mative approach is to provide probabilistic classifications. Although 
a recent work by [Henrion et al.[ ( [201 1[ ) implemented a probabilistic 
classification using a Bayesian approach on the morphological mea¬ 
surements alone, here we use a cut-based morphological separation 
to demonstrate the advantages of our combination techniques. In par¬ 
ticular, we later show that the binary outputs (i.e., 0 or 1) of cut-based 
methods can be transformed into probability estimates by combining 
them with the probability outputs from other probabilistic classifica¬ 
tion techniques, such as TPC, SOMc, and HB. 
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Figure 1. Half-light radius vs. magnitude. 


2.2 Supervised Machine Learning: TPC 

TPC is a parallel, supervised machine learning algorithm that uses 
prediction trees and random forest techniques ( [Breiman et al.|198l] 
[Breiman|2001[ ) to produce a star-galaxy classification. TPC is a part 
of a publicly available software package called ML2[^ (Machine 
Learning for Photo-z). The full software package includes: TPZ, a 
supervised photometric redshift (photo-z) estimation technique (re¬ 
gression mode; [Carrasco Kind & Brunner|2013] ); TPC, a supervised 
star-galaxy classification technique (classification mode); SOMz, an 
unsupervised photo-z technique (regression mode; [Carrasco Kind &[ 
|Brunner|2014b> ; and SOMc, an unsupervised star-galaxy classifica¬ 
tion technique (classification mode). 

TPC uses classification trees, a type of prediction trees that are 
designed to provide a classification or predict a discrete category. Pre¬ 
diction trees are built by asking a sequence of questions that recur¬ 
sively split the data into branches until a terminal leaf is created that 
meets a stopping criterion (e.g., a minimum leaf size). The optimal 
split dimension is decided by choosing the attribute that maximizes 
the Information Gain (/g), which is defined as 


Ig (-Onodej A^) — Id (-Onode) 


E 

ccGvalues(X) 


|-Onode,a; 

I -Onode I 


Id (-Onode,cc) ; 


( 1 ) 

where Dnode is the training data in a given node, X is one of the pos¬ 
sible dimensions (e.g., magnitudes or colors) along which the node is 
split, and x are the possible values of a specific dimension X. |E>node | 
and \Dnode,x\ are the size of the total training data and the number 
of objects in a given subset x within the current node, respectively. 
Id is the impurity degree index, and TPC can calculate Id from any 
of the three standard different impurity indices: information entropy, 
Gini impurity, and classification error. In this work, we use the in¬ 
formation entropy, which is defined similarly to the thermodynamic 
entropy: 


Id (D) — fg log 2 fg (1 fg) log 2 (1 fg) , (2) 

where fg is the fraction of galaxies in the training data. At each node 
in our tree, we scan all dimensions to identify the split point that max¬ 
imizes the information gain as defined by Equation and select the 
attribute that maximizes the impurity index overall. 

In a technique called random forest, we create bootstrap samples 
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(i.e., N randomly selected objects with replacement) from the input 
training data by sampling repeatedly from the magnitudes and col¬ 
ors using their measurement errors. We use these bootstrap samples 
to construct multiple, uncorrelated prediction trees whose individual 
predictions are aggregated to produce a star-galaxy classification for 
each source. 

We also use a cross-validation technique called Out-of- 
Bag (OOB; [Breiman et al.||1984[ [Carrasco Kind & Brunner||2013| l. 
When a tree (or a map) is built in TPC (or SOMc), a fraction of the 
training data, usually one-third, is left out and not used in training 
the trees (or maps). After a tree is constructed using two-thirds of the 
training data, the final tree is applied to the remaining one-third to 
make a classification. This process is repeated for every tree, and the 
predictions from each tree are aggregated for each object to make the 
final star-galaxy classification. We emphasize that if an object is used 
for training a given tree, it is never used for subsequent prediction 
by that tree. Thus, the OOB data is an unbiased estimation of the er¬ 
rors and can be used as cross-validation data as long as the OOB data 
remain similar to the final test data set. The OOB technique can also 
provide extra information such as a ranking of the relative importance 
of the input attributes used in the prediction. The OOB technique can 
prove extremely valuable when calibrating the algorithm, when de¬ 
ciding which attributes to incorporate in the construction of the trees, 
and when combining this approach with other techniques. 


2.3 Unsupervised Machine Learning: SOMc 

A self-organizing map ( |Kohonen| 1 990| [200T| l is an unsupervised, ar¬ 
tificial neural network algorithm that is capable of projecting high¬ 
dimensional input data onto a low-dimensional map through a pro¬ 
cess of competitive learning. In astronomical applications, the high¬ 
dimensional input data can be magnitudes, colors, or some other 
photometric attributes. The output map is usually chosen to be two- 
dimensional so that the resulting map can be used for visualizing var¬ 
ious properties of the input data. The differences between a SOM and 
typical neural network algorithms are that a SOM is unsupervised, 
there are no hidden layers and therefore no extra parameters, and it 
produces a direct mapping between the training set and the output 
network. In fact, a SOM can be viewed as a non-linear generalization 
of a principal component analysis (PCA) algorithm ( |Yin|2008} . 

The key characteristic of SOM is that it retains the topology 
of the input training set, revealing correlations between input data 
that are not obvious. The method is unsupervised: the user is not re¬ 
quired to specify the desired output during the creation of the lower¬ 
dimensional map, and the mapping of the components from the input 
vectors is a natural outcome of the competitive learning process. 

During the construction of a SOM, each node on the two- 
dimensional map is represented by weight vectors of the same di¬ 
mension as the number of attributes used to create the map itself. In 
an iterative process, each object in the input sample is individually 
used to correct these weight vectors. This correction is determined 
so that the specific neuron (or node), which at a given moment best 
represents the input source, is modified along with the weight vec¬ 
tors of that node’s neighboring neurons. As a result, this sector within 
the map becomes a better representation of the current input object. 
This process is repeated for every object in the training data, and the 
entire process is repeated for several iterations. Eventually, the SOM 
converges to its final form where the training data is separated into 
groups of similar features. Although the spectroscopic labels are not 
used at all in the learning process, they are used (only after the map 
has been constructed) to generate predictions for each cell in the re¬ 
sulting two-dimensional map. 

In a similar approach to random forest in TPZ and TPC, SOM 2 ; 


uses a technique called random atlas to provide photo-z estimation 
( [Carrasco Kind & Brunner|r2014b| ). In random atlas, the prediction 
trees of random forest are replaced by maps, and each map is con¬ 
structed from different bootstrap samples of the training data. Fur¬ 
thermore, we create random realizations of the training data by per¬ 
turbing the magnitudes and colors by their measurement errors. For 
each map, we can either use all available attributes, or randomly se¬ 
lect a subsample of the attribute space. This SOM implementation 
can also be applied to the classification problem, and we refer to it as 
SOMc in order to differentiate it from the photo-z estimation problem 
(regression mode). We also use the random atlas approach in some of 
the classification combination approaches as discussed in Sectionj^ 
One of the most important parameter in SOMc is the topology 
of the two-dimensional SOM, which can be rectangular, hexagonal, 
or spherical. In our SOM implementation, it is also possible to use 
periodic boundary conditions for the non-spherical cases. The spher¬ 
ical topology is by definition periodic and is constructed by using 
HEALPIX ( jGors^ et al.]|200^ . Similar to TPC, we use the OOB 
technique to make an unbiased estimation of errors. We determine 
the optimal parameters by performing a grid search in the parameter 
space of different toplogies, as well as other SOM parameters, for the 
OOB data. We find that the spherical topology gives the best perfor¬ 
mance for the CFHTLenS data, likely due to its natural periodicity. 
Thus, we use a spherical topology to classify stars and galaxies in the 
CFHTLenS data. For a complete description of the SOM implemen¬ 
tation and its application to the estimation of photo-z probability den¬ 
sity functions (photo-z PDFs), we refer the reader to [Carrasco Kindj 
[& Brunn^ ( [2014b[ l. 


2.4 Template fitting: Hierarchical Bayesian 

One of the most common methods to classify a source based on its 
observed magnitudes is template fitting. Template fitting algorithms 
do not require a spectroscopic training sample; there is no need for ad¬ 
ditional knowledge outside the observed data and the template SEDs. 
However, any incompleteness in our knowledge of the template SEDs 
that fully span the possible SEDs of observed sources may lead to 
misclassification of sources. 

Bayesian algorithms use Bayesian inference to quantify the rela¬ 
tive probability that each template matches the input photometry and 
determine a probability estimate by computing the posterior that a 
source is a star or a galaxy. In this work, we have modified and par¬ 
allelized a publicly available Hierarchical Bayesian (HB) template 
fitting algorithm by [Fadely et al.[ ( [20T2| ). In this section, we provide a 
brief description of the HB template fitting technique; for the details 
of the underlying HB approach, we refer the reader to [Faddy et al.[ 

( [20T2I ). 

We write the posterior probability that a source is a star as 

p{s\x,e)^p{x\s,e)p{s\e), o) 

where x represents a given set of observed magnitudes,. We have also 
introduced the hyperparameter 0, a nuisance parameter that charac¬ 
terizes our uncertainty in the prior distribution. To compute the like¬ 
lihood that a source is a star, we marginalize over all star and galaxy 
templates T. In a template-fitting approach, we marginalize by sum¬ 
ming up the likelihood that a source has the set of magnitudes x for 
a given star template as well as the likelihood for a given galaxy tem¬ 
plate: 

P (a;|5, 0) = Y,P (a;|5, t, 9) P (t|5, 9). (4) 

teT 
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The likelihood of each template P {x\S, 0) is itself marginalized over 
the uncertainty in the template-fitting coefficient. Furthermore, for 
galaxy templates, we introduce another step that marginalizes the 
likelihood by redshifting a given galaxy template by a factor of 1 + z. 

Marginalization in Equation [^requires that we specify the prior 
probability P{t\S,0) that a source has a spectral template t (at a 
given redshift). Thus, the probability that a source is a star (or a 
galaxy) is either the posterior probability itself if a prior is used, or the 
likelihood itself if an uninformative prior is used. In a Bayesian anal¬ 
ysis, it is preferable to use a prior, which can be directly computed 
either from physical assumptions, or from an empirical function cali¬ 
brated by using a spectroscopic training sample. In an HB approach, 
the entire sample of sources is used to infer the prior probabilities for 
each individual source. 

Since the templates are discrete in both SED shape and physical 
properties, we parametrize the prior probability of each template as a 
discrete set of weights such that 

Y,P{t\s,e) = i. ( 5 ) 

teT 

Similarly, we also parametrize the overall prior probability, (S^l^), 
in Equation!^ as a weight. These weights correspond to the hyper¬ 
parameters, which can be inferred by sampling the posterior proba¬ 
bility distribution in the hyperparameter space. Eor the sampling, we 
use EMCEE, a Python implementation of the affine-invariant Markov 
Chain Monte Carlo (MCMC) ensemble sampler (|Eoreman-Mackey| 
|etal.|2013| l. 

As the goal of template fitting methods is to minimize the dif¬ 
ference between observed and theoretical magnitudes, this approach 
heavily relies on both the use of SED templates and the accuracy 
of the transmission functions for the filters used for particular sur¬ 
vey. Eor our stellar templates, we use the empirical SED library from 
|Pickles|fl998| ). The Pickles library consists of 131 stellar templates, 
which span all normal spectral types and luminosity classes at solar 
abundance, as well as metal-poor and metal-rich E-K dwarf and G-K 
giant and supergiant stars. We supplement the stellar library with 100 
SEDs from |Chabrier et ^ ( |2000| ), which include low mass stars and 
brown dwarfs with different Tgff and surface gravities. We also in¬ 
clude four white dwarf templates of |Bohlin, Colina & Einley|fl995| ), 
for a total of 235 templates in our final stellar library. Eor our galaxy 
templates, we use four CWW spectra from |Coleman, Wu & Weedni^ 
l |1980| ), which include an Elliptical, an Sba, an Sbb, and an Irregular 
galaxy template. When extending an analysis to higher redshifts, the 
CWW library is often augmented with two star bursting galaxy tem¬ 
plates from [Kinney et al.| ( |1996| ). Erom the six original CWW and 
Kinney spectra, intermediate templates are created by interpolation, 
for a total of 51 SEDs in our final galaxy library. 

All of the above templates are convolved with the filter response 
curves to generate model magnitudes. These response curves consist 
of u, g, r, i, z filter transmission functions for the observations taken 
by the Canada-Erance Hawaii Telescope (CEHT). 


3 CLASSIFICATION COMBINATION METHODS 

Building on the work in the field of ensemble learning, we combine 
the predictions from individual star-galaxy classification techniques 
using four combination techniques. The main idea behind ensemble 
learning is to weight the predictions from individual models and com¬ 
bine them to obtain a prediction that outperforms every one of them 
individually < |Rokach|2010| l. 


3.1 Unsupervised Binning 

Given the variety of star-galaxy classification methods we are using, 
we fully expect the relative performance of the individual techniques 
to vary across the parameter space spanned by the data. Eor example, 
it is reasonable to expect supervised techniques to outperform other 
techniques in areas of parameter space that are well-populated with 
training data. Similarly, we can expect unsupervised approaches such 
as SOM or template fitting approaches to generally perform better 
when a training sample is either sparse or unavailable. 

We therefore adopt a binning strategy similar to |Carrasco Kind] 
|& Brunn^ ( |2014a| ). In this binning strategy, we allow different clas¬ 
sifier combinations in different parts of parameter space by creating 
two-dimensional SOM representations of the full nine-dimensional 
magnitude-color space: u, g, r, i, z, u — g, g — r, r — z, and i — z. A 
SOM representation can be rectangular, hexagonal, or spherical; here 
we choose a 10 x 10 rectangular topology to facilitate visualization as 
shown in Eigure|^ We note that this choice is mainly for convenience 
and that the optimal topology and map size would likely depend on a 
number of factors, such as the number of objects and attributes. Eor all 
combination methods, we use only the OOB (cross-validation) data 
contained in each cell to compute the relative weights for the base 
classifiers. The weights within individual cells are then applied to the 
blind test data set to make the prediction. 

Eurthermore, we construct a collection of SOM representations 
and subsequently combine the predictions from each map into a meta¬ 
prediction. Given a training sample of N sources, we generate Nr 
random realizations of training data by perturbing the attributes with 
the measured uncertainty for each attribute. The uncertainties are as¬ 
sumed to be normally distributed. In this manner, we reduce the bias 
towards the data and introduce randomness in a systematic manner. 
Eor each random realization of a training sample, we create Nm boot¬ 
strap samples of size N to generate Nm different maps. 

After all maps are built, we have a total of Nr x Nm probabilis¬ 
tic outputs for each of the N sources. To produce a single probability 
estimate for each source, we could take the mean, the median, or some 
other simple statistic. With a sufficient number of maps, we find that 
there is usually negligible difference between taking the mean and 
taking the median, and use the median in the following sections. We 
note that it is also possible to establish confidence intervals using the 
distribution of the probability estimates. 

3.2 Weighted Average 

The simplest approach to combine different combination techniques 
is to simply add the individual classifications from the base classifiers 
and renormalize the sum. In this case, the final probability is given by 

P{S\x,M) = ^P{S\x,Mi), ( 6 ) 

i 

where AT is the set of models (TPC, SOMc, HB, and morphologi¬ 
cal separation in our work). We improve on this simple approach by 
using the binning strategy to calculate the weighted average of ob¬ 
jects in each SOM cell separately for each map, and then combine the 
predictions from each map into a final prediction. 

3.3 Bucket of Models (BoM) 

After the multi-dimensional input data have been binned, we can use 
the cross-validation data to choose the best model within each bin, 
and use only that model within that specific bin to make predictions 
for the test data. We use the mean squared error (MSE; also known 
as Brier score ( |Brier|195'0] )) as a classification error metric. We define 
MSE as 
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N-l 

i=0 


(7) 


A7-1 

P{e\D) oc P{e) ytyi + (1 - iji) (1 - yi) ■ (11) 

i=0 


where yi is the actual truth value (e.g., 0 or 1) of the z* data, and yi 
is the probability prediction made by the models. Thus, a model with 
the minimum MSE is chosen in each bin, and is assigned a weight 
of one, and zero for all other models. However, the chosen model is 
allowed to vary between different bins. 


3.4 Stacking 

Instead of selecting a single model that performs best within each bin, 
we can train a learning algorithm to combine the output values of sev¬ 
eral other base classifiers in each bin. An ensemble learning method 
of using a meta-classifier to combine lower-level classifiers is known 
as stacking or stacked generalization ( |Wolpert|1992| ). Although any 
arbitrary algorithm can theoretically be used as a meta-classifier, a 
logistic regression or a linear regression is often used in practice. In 
our work, we use a single-layer multi-response linear regression algo¬ 
rithm, which often shows the best performance preiman fTWirnHjl 
|& Witten]|1999| l. This algorithm is a variant of the least-square re¬ 
gression algorithm, where a linear regression model is constructed 
for each class. 


3.5 Bayesian Model Combination 

We also use a model combination technique known as Bayesian 
Model Combination (BMC; [Monteith et al.| [20111 ), which uses 
Bayesian principles to generate an ensemble combination of differ¬ 
ent classifiers. The posterior probability that a source is a star is given 
by 


Although the space E of potential model combinations is in 
principle infinite, we can produce a reasonable finite set of potential 
model combinations by using sampling techniques. In our implemen¬ 
tation, the weights of each combination of the base classifiers is ob¬ 
tained by sampling from a Dirichlet distribution. We first set all alpha 
values of a Dirichlet distribution to unity. We then sample this distri¬ 
bution q times to obtain q sets of weights. For each combination, we 
assume a uniform prior and calculate P {e\D) using Equationor 


11 We select the combination with the highest P {e\D), and update 


the alpha values by adding the weights of the most probable combina¬ 
tion to the current alpha values. The next q sets of weights are drawn 
using the updated alpha values. 

We continue the sampling process until we reach a predefined 
number of combinations, and finally use Equation to compute the 
posterior probability that a source is a star (or a galaxy). In this paper, 
we use a q value of three, and 1,000 model combinations are consid¬ 
ered. 


We also use a binned version of the BMC technique, where we 
use a SOM representation to apply different model combinations for 
different regions of the parameter space. We however note that intro¬ 
ducing randomness though the construction of Nr x Nm different 
SOM representations does not show significant improvement over us¬ 
ing only one single SOM representation. This similarity is likely due 
to the randomness that has already been introduced by sampling from 
the Dirichlet distribution. Thus, our BMC technique uses one SOM, 
while other base models (WA, BoM, and stacking) generate Nr ran¬ 
dom realizations of Nm maps. 


4 DATA 


P (51®, D,M,E) = J2p e) P {e\D), (8) 

eCE 

where D is the data set, and e is an element in the ensemble space E 
of possible model combinations. By Bayes’ Theorem, the posterior 
probability of e given D is given by 

Pi^\D) = N^]\P{d\e)^P{e)]\P{d\e). (9) 

^ ^ dCD deD 

Here, P (e) is the prior probability of e, which we assume to be uni¬ 
form. The product of P {d\e) is over all individual data d in the train¬ 
ing data D, and P (D) is merely a normalization factor and not im¬ 
portant. 

For binary classifiers whose output is either zero or one (e.g., a 
cut-based morphological separation), we assume that each example 
is corrupted with an average error rate e. This means that P {d\e) = 
1 — e if the combination e correctly predicts class yi for the z* ob¬ 
ject, and P {d\e) = e if it predicts an incorrect class. The average 
rate e can be estimated by the fraction (Mg + Ms) /N, where Mg is 
the number of true galaxies classified as stars, Ms is the number of 
true stars classified as galaxies, and N is the total number of sources. 
Equation!^ then becomes 


P {e\D) oc P (e) (1 - 

For probabilistic classifiers, we can directly use the probabilistic pre¬ 
dictions and write Equation as 


We use photometric data from t he Canada-France-Hawaii Telescope 


Lensing Survey (CFHTLen^ Heymans et al 


2012 


Erben et al. 


|2013[ |Hildebrandt et al.|[201^ . This catalog consists of more than 
twenty five million objects with a limiting magnitude of zab ~ 25.5. 
It covers a total of 154 square degrees in the four fields (named Wl, 
W2, W3, and W4) of CFHT Legacy Survey (CFHTLS; [Gwyn|2012| l 
observed in the five photometric bands: zz, g, r, z, and z. 

We have cross-matched reliable spectroscopic galaxies from the 
Deep Extragalactic Evolutionary Probe Phase 2 (DEEP2; [Davis et ah] 
|2003| [Newman et al.|201^ , the Sloan Digital Sky Survey Data Re¬ 
lease 10 ^Ahn et al. |2014| SDSS-DRIO), the Visible imaging Multi- 
Object Spectrograph (VIMOS) Very Large Telescope (VLT) Deep 
Survey (VVDS; |Le Fevre et al.||2005[ [Garilli et~^|2008|), and the 
VIMOS Public Extragalactic Redshift Survey (VIPERS; [Garilli et*^ 
[2014| l. We have selected only sources with very secure redshifts and 
no bad flags (quality fiags -1,3, and 4 for DEEP2; quality flag 0 for 
SDSS; quality fiags 3, 4, 23, and 24 for VIPERS and VVDS). In the 
end, we have 8,545 stars and 57,843 galaxies available for the training 
and testing processes. We randomly select 13,278 objects for the blind 
testing set, and use the remainder for training and cross-validation. 
While HB uses only the magnitudes in the five bands, u, g, r, z, and 
z, TPC and SOMc are trained with a total of 9 attributes: the five 
magnitudes and their corresponding colors, u — g, g — r, r — i, and 
i — z. The morphological separation method uses SExtractor’s 
FLUX_RADIUS parameter provided by the CFHTLenS catalog. 

Our goal here is not to obtain the best classifier performance; 


^ http://www.cfhtlens.org/ 
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for this we would have fine tuned individual base classifiers and cho¬ 
sen sophisticated models best suited to the particular properties of the 
CFHTLenS data. For example, [Hildebrandt et al.| ( p012| ) suggest that 
all objects with i > 23 in the CFHTLenS data set may be classified 
as galaxies without significant incompleteness and contamination in 
the galaxy sample. Although this approach works because the high 
Galactic latitude fields of the CFHTLS contain relatively few stars, it 
is very unlikely that such an approach will meet the science require¬ 
ments for the quality of star-galaxy classification in lower-latitude, 
star-crowded fields. Rather, our goal for the CFHTLenS data set is 
to demonstrate the usefulness of combining different classifiers even 
when the base classifiers may be poor or trained on partial data. We 
also note that the relatively few number of stars in the CFHTLS fields 
might paint too positive a picture of completeness and purity, espe¬ 
cially for the stars. Thus, we caution the reader that the specific com¬ 
pleteness and purity values will likely vary in other surveys that ob¬ 
serve large portions of the sky, and we emphasize once again that our 
aim is to highlight that there is a relative improvement in performance 
when we combine multiple star-galaxy classification techniques to 
generate a meta-classification. 


5 RESULTS AND DISCUSSION 

In this section, we present the classification performance of the four 
different combination techniques, as well as the individual star-galaxy 
classification techniques on the CFHTLenS test data. 


5.1 Classification Metrics 

Probabilistic classification models can be considered as functions that 
output a probability estimate of each source to be in one of the classes 
(e.g., a star or a galaxy). Although the probability estimate can be 
used as a weight in subsequent analyses to improve or enhance a par¬ 
ticular measurement ( |Ross et al.|2011| ), it can also be converted into a 
class label by using a threshold (a probability cut). The simplest way 
to choose the threshold is to set it to a fixed value, e.g., pcut = 0.5. 
This is, in fact, what is often done (e.g., |Henrion et al.|201l1|Fadely| 
|et al.|[20T^ . However, choosing 0.5 as a threshold is not the best 
choice for an unbalanced data set, where galaxies outnumber stars. 
Furthermore, setting a fixed threshold ignores the operating condi¬ 
tion (e.g., science requirements, stellar distribution, misclassification 
costs) where the model will be applied. 


5.1.1 Receiver Operating Characteristic Curve 

When we have no information about the operating condition when 
evaluating the performance of classifiers, there are effective tools such 
as the Receiver Operating Characteristic (ROC) curve ( |Swets, Dawes| 
|& Monahan|2000| l. An ROC curve is a graphical plot that illustrates 
the true positive rate versus the false positive rate of a binary classifier 
as its classification threshold is varied. The Area Under the Curve 
(AUC) summarizes the curve information in a single number, and can 
be used as an assessment of the overall performance. 


5.1.2 Completeness and Purity 

In astronomical applications, the operating condition usually trans¬ 
lates to the completeness and purity requirements of the star or galaxy 
sample. We define the galaxy completeness Cg (also known as recall 
or sensitivity) as the fraction of the number of true galaxies classified 


Table 1. The definition of the classification performance metrics. 


Metric 

Meaning 

AUC 

Area under the Receiver Operating Curve 

MSB 

Mean squared error 

Cg 

Galaxy completeness 

Pg 

Galaxy purity 

Cs 

Star completeness 

Ps 

Star purity 

77 

II 

7L 

Galaxy purity at x galaxy completeness 

Ps{Cs = x) 

Star purity at x star completeness 


as galaxies out of the total number of true galaxies. 


Ng + MC 


( 12 ) 


where Ng is the number of true galaxies classified as galaxies, and 
Mg is the number of true galaxies classified as stars. We define the 
galaxy purity pg (also known as precision or positive predictive value) 
as the fraction of the number of true galaxies classified as galaxies out 
of the total number of objects classified as galaxies. 


Ng 

Ng + Ms 


(13) 


where Ms is the number of true stars classified as galaxies. Star com¬ 
pleteness and purity are defined in a similar manner. 

One of the advantages of a probabilistic classification is that the 
threshold can be adjusted to produce a more complete but less pure 
sample, or a less complete but more pure one. To compare the per¬ 
formance of probabilistic classification techniques with that of mor¬ 
phological separation, which has a fixed completeness (cg = 0.9964, 
Cs = 0.7145) at a certain purity (pg = 0.9597, ps = 0.9666), we 
adjust the threshold of probabilistic classifiers until the galaxy com¬ 
pleteness Cg matches that of morphological separation to compute 
the galaxy purity pg at Cg = 0.9964. Similarly, the star purity ps 
at Cs = 0.7145 is computed by adjusting the threshold until the star 
completeness of each classifier is equal to that of morphological sep¬ 
aration. 

We can also compare the performance of different classifica¬ 
tion techniques by assuming an arbitrary operating condition. For 
example, weak lensing science measurements of the DBS require 
Cg > 0.960 and pg > 0.778 to control both the statistical and sys¬ 
tematic errors on the cosmological parameters, and Cs > 0.250 and 
Ps > 0.970 for stellar Point Spread Function (PSF) calibration ( |Sou-| 
[magnac et al.||2015] l. Although these values will likely be different 
for the science cases of the CFHTLenS data, we adopt these values 
to compare the classification performance at a reasonable operating 
condition. Thus, we compute pg at Cg = 0.960 andps at Cs = 0.250. 
We also use the MSB defined in Bquation|^as a classification error 
metric. 


5.2 Classifier Combination 

We present in Table |^the classification performance obtained by ap¬ 
plying the four different combination techniques, as well as the in¬ 
dividual star-galaxy classification techniques, on the CBHTLenS test 
data. The bold entries highlight the best technique for any particu¬ 
lar metric. The first four rows show the performance of four individ¬ 
ual star-galaxy classification techniques. Given a high-quality training 
data, it is not surprising that our supervised machine learning tech¬ 
nique TPC outperforms other unsupervised techniques. TPC is thus 
shown in the first row as the benchmark. 

The simplest of the combination techniques, WA and BoM, gen¬ 
erally do not perform better than TPC. It is also interesting that, even 
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Table 2. A summary of the classification performance metrics for the four individual methods and the four different classification combination methods as applied 
to the CFHTLenS data, with no cut applied to the training data set. The definition of the metrics is summarized in Table [T] The bold entries highlight the best 
performance values within each column. Note that some objects in the test set have bad or missing values (e.g., —99 or 99) in one or more attributes, which are 
included here (but are omitted, for example, in Figurej^when the corrsponding attribute is not available.) 


Classifier 

AUC 

MSB 

Pg {cg = 0.9964) 

Ps {cs = 0.7145) 

Pg (Cg = 0.9600) 

Ps {cs = 0.2500) 

TPC 

0.9870 

0.0208 

0.9714 

0.9838 

0.9918 

0.9977 

SOMc 

0.9683 

0.0452 

0.9125 

0.8454 

0.9788 

0.9551 

HB 

0.9403 

0.0705 

0.9219 

0.7017 

0.9471 

0.6963 

Morphology 

- 

0.0397 

0.9597 

0.9666 

- 

- 

WA 

0.9806 

0.0266 

0.9755 

0.9926 

0.9872 

0.9977 

BoM 

0.9870 

0.0208 

0.9714 

0.9838 

0.9918 

0.9977 

Stacking 

0.9842 

0.0194 

0.9752 

0.9902 

0.9918 

1.0000 

BMC 

0.9852 

0.0174 

0.9800 

0.9959 

0.9924 

1.0000 



Figure 2. A two-dimensional lOx 10 SOM representation showing the mean 
z-band magnitude (top left), the fraction of true stars in each cell (top right), 
and the mean values ofu — g (bottom left) and g — r (bottom right) for the 
cross-validation data. 


with binning the parameter space and selecting the best model within 
each bin, BoM almost always chooses TPC as the best model in all 
bins, and therefore gives the same performance as TPC in the end. 
However, our BMC and stacking techniques have a similar perfor¬ 
mance and often outperform TPC. Although TPC shows the best per¬ 
formance as measured by the AUC, BMC shows the best performance 
in all other metrics. 

In Figure]^ we show in the top left panel the mean CFHTLenS 
z-band magnitude in each cell, and in the top right panel the frac¬ 
tion of stars in each cell. The bottom two panels show the mean 
u — g and g — r colors in each cell. These two-dimensional maps 
clearly show the ability of the SOM to preserve relationships between 
sources when it projects the full nine-dimensional space to the two- 
dimensional map. We note that these SOM maps should only be used 
to provide guidance, as the SOM mapping is a non-linear representa¬ 
tion of all magnitudes and colors. 

We can also use the same SOM from Figure to determine the 
relative weights for the four individual classification methods in each 
cell. We present the four weight maps for the BMC technique in Fig¬ 
ure In these maps, a darker color indicates a higher weight, or 
equivalently that the corresponding classifier performs better in that 
region. These weight maps demonstrate the variation in the perfor¬ 
mance of the individual techniques across the two-dimensional pa¬ 
rameter space defined by the SOM. Furthermore, since the maps in 
Figure 1^ and are constructed using the same SOM, we can deter¬ 
mine the region in the parameter space where each individual tech- 




0.64 

0.56 

0.48 

0.40 



Figure 3. A two-dimensional 10x10 SOM representation showing the relative 
weights for the BMC combination technique applied to the four individual 
methods for the CFHTLenS data. 


nique performs better or worse. Not surprisingly, the morphological 
separation performs best in the top left corner of the weight map in 
Figure which corresponds to the brightest CFHTLenS magnitudes 
z < 20 in the z-band magnitude map of Figure It is also clear 
that the SOM cells where the morphological separation performs best 
have higher stellar fraction than the other cells. On the other hand, 
TPC seems to perform best in the region that corresponds to interme¬ 
diate magnitudes 20 < z < 22.5 and 1.^ < u — g < 3.0. Our unsu¬ 
pervised learning method SOMc performs relatively better at fainter 
magnitudes z > 21.5 with 0 < zz — p < 0.5 and 0 < — r < 0.5. 

Although HB shows the worst performance when there exists a high- 
quality training data set, BMC still utilizes information from HB, es¬ 
pecially at intermediate magnitudes 20 < z < 22. Another interesting 
pattern is that the four techniques seem complementary, and they are 
weighted most strongly in different regions of the SOM representa¬ 
tion. 

In Figure]^ we compare the star and galaxy purity values for 
BMC, TPC, and morphological separation as functions of z-band 
magnitude. We use the kernel density estimation (KDE; [Silverman] 
|1986| l with the Gaussian kernel to smooth the fluctuations in the dis¬ 
tribution. Although morphological separation shows a slightly better 
performance in galaxy purity at bright magnitudes z < 20, BMC out¬ 
performs both TPC and morphological separation at faint magnitudes 
z > 21. As the top panel shows, the number count distribution peaks 
at z ~ 22, and BMC therefore outperforms both TPC and morpholog¬ 
ical separation for the majority of objects. It is also clear that BMC 
outperforms TPC over all magnitudes. BMC can presumably accom¬ 
plish this by combining information from all base classifiers, e.g., 
giving more weight to the morphological separation method at bright 
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Figure 4. Purity as a function of the z-band magnitude as estimated by the ker¬ 
nel density estimation (KDE) method. The top panel shows the histogram with 
a bin size of 0.1 mag and the KDE for objects in the test set. The second panel 
shows the fraction of stars estimated by KDE as a function of magnitude. The 
bottom two panels compare the galaxy and star purity values for BMC, TPC, 
and morphological separation as functions of magnitude. Results for BMC, 
TPC, and morphological separation are in blue, green, and red, respectively. 
The la confidence bands are estimated by bootstrap sampling. 


magnitudes. The bottom panel shows that the star purity of morpho¬ 
logical separation drops to ps < 0.8 at fainter magnitudes z > 21. 
This is expected, as our crude morphological separation classifies ev¬ 
ery object as a galaxy beyond z > 21, and purity measures the number 
of true stars classified as stars. It is again clear that BMC outperforms 
both TPC and morphological separation in star purity values over all 
magnitudes. 

In Figure we show the cumulative galaxy and star purity val¬ 
ues as functions of magnitude. Although morphological separation 
performs better than TPC at bright magnitudes, its purity values de¬ 
crease as the magnitudes become fainter, and TPC eventually out¬ 
performs morphological separation by 1-2% at z > 21. BMC clearly 
outperforms both TPC and morphological separation, and it maintains 
the overall galaxy purity of 0.980 up to z ~ 24.5. 

We also show the star and galaxy purity values as functions of 
photometric redshift estimate in Figure]^ Photo-z is estimated with 
the BPZ algorithm ( |Bemtez|2000|) and provided with the CFHTLenS 
photometric redshift catalogue |Hildebrandt et al.||2012| ). The ad¬ 
vantage of BMC over TPC and morphological separation is now 
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Figure 5. Cumulative purity as a function of the z-band magnitude. The up¬ 
per panel compares the galaxy purity values for BMC (blue solid line), TPC 
(green dashed line), and morphological separation (red dashed line). The lower 
panel compares the star purity. The la error bars are computed following the 
method of |Paterno| ( |2003) to avoid the unphysical errors of binomial or Pois¬ 
son statistics. 


more pronounced in Figure Although the morphological separa¬ 
tion method outperforms BMC at bright magnitudes in Figure]^ it is 
clear that BMC outperforms both TPC and morphological separation 
over all redshifts. We also present in Figurej^how the star and galaxy 
purity values vary as a function of ^ — r color. It is again clear that 
BMC outperforms both TPC and morphological separation over all 
g — r colors. 

In Figure]^ we show the distribution of P{S), the posterior 
probability that an object is a star, for BMC, TPC, and morphological 
separation. It is interesting that the BMC technique assigns a poste¬ 
rior star probability P{S) < 0.3 to significantly more true galaxies 
than TPC, and a probability P{S) > 0.8 to significantly fewer true 
galaxies. By utilizing information from different types of classifica¬ 
tion techniques in different parts of the parameter space, BMC be¬ 
comes more certain that an object is a star or a galaxy, resulting in 
improvement of overall performance. 


5.3 Heterogeneous Training 

It is very costly in terms of telescope time to obtain a large sample 
of spectroscopic observations down to the limiting magnitude of a 
photometric sample. Thus, we investigate the impact of training set 
quality by considering a more realistic case where the training data 
set is available only for a small number of objects with bright magni¬ 
tudes. To emulate this scenario, we only use objects that have spectro¬ 
scopic labels from the VVDS 0226-04 field (which is located within 
the CFHTLS W1 field) and impose a magnitude cut of z < 22.0 in the 
training data, leaving us a training set with only 1,365 objects. We ap¬ 
ply the same four star-galaxy classification techniques and four com¬ 
bination methods, and measure the performance of each technique on 
the same test data set from Section [5^ As the top two panels of Fig¬ 
ures [TT][^ andshow, the demographics of objects in the training 
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Table 3. A summary of the classification performance metrics for the four individual methods and the four different classification combination methods when the 
training data set consists of only the sources that are in CFHTLS W1 field, has spectroscopic labels available from VVDS, and has i < 22. The dehnition of the 
metrics is summarized in Table [T] The bold entries highlight the best performance values within each column. Note that some objects in the test set have bad or 
missing values (e.g., —99 or 99) in one or more attributes, which are included here (but are omitted, for example, in Figure [T^ when the corrsponding attribute is 
not available.) 


Classifier 

AUC 

MSB 

Pg {cg = 0.9964) 

Ps {cs = 0.7145) 

Pg (Cg = 0.9600) 

Ps {cs = 0.2500) 

TPC 

0.9399 

0.0511 

0.9350 

0.7060 

0.9570 

0.9747 

SOMc 

0.8861 

0.0989 

0.8843 

0.4316 

0.9165 

0.6263 

HB 

0.9386 

0.0760 

0.9325 

0.6911 

0.9424 

0.6918 

Morphology 

- 

0.0397 

0.9597 

0.9666 

- 

- 

WA 

0.9600 

0.0536 

0.9208 

0.8818 

0.9757 

0.9815 

BoM 

0.9587 

0.1511 

0.9658 

0.9862 

0.9790 

0.9977 

Stacking 

0.9442 

0.1847 

0.9561 

0.9309 

0.9664 

0.9983 

BMC 

0.9738 

0.0291 

0.9696 

0.9862 

0.9856 

1.0000 



0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 

-^phot 
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Figure 6. Similar to Figure]^ but as a function of photo z. The bin size of Figure 7. Similar to Figurej^but as a function of — r color. The bin size of 
histogram in the top panel is 0.02. histogram in the top panel is 0.05. 


set are different from the distribution of sources in the test set. Thus, 
this also serves as a test of the efficacy of heterogeneous training. 

We present in Table the same six metrics for each method, 
and highlight the best method for each metric. Overall, the results ob¬ 
tained for the reduced data set are remarkable. With a smaller training 
set, our training based methods, TPC and SOMc, suffer a significant 
decrease in performance. The performance of morphological separa¬ 
tion and HB is essentially unchanged from Table as they do not 
depend on the training data. Without sufficient training data, the ad¬ 


vantage of combining the predictions of different classifiers is more 
obvious. Even WA, the simplest of combination techniques, outper¬ 
forms all individual classification techniques in four metrics, AUC, 
Ps at Cs = 0.7145, pg at Cg = 0.9600, and ps at Cs = 0.2500. Al¬ 
though BoM always chooses TPC as the best model when we have a 
high-quality training set, it now chooses various methods in different 
bins and outperforms all base classifiers. While the performance of 
the stacking technique is only slightly worse than that of BMC when 
we have a high-quality training set, stacking now fails to outperform 
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Figure 8. Histogram of the posterior probability that a source is a star for 
morphological separation (top), TPC (middle), and BMC (bottom) for a high- 
quality training data set. The true galaxies are in green, and true stars are in 
blue. The bin size is 0.05. 



Figure 9. Similar to Figure[^but for the reduced training data set. 


morphological separation. BMC shows an impressive performance 
and outperforms all other classification techniques in all six metrics. 
Overall, the improvements are small but still significant since these 
metrics are averaged over the full test data. 

In Figure[^ we again show the 10x10 two-dimensional weight 
map defined by the SOM. When the quality of training data is rel¬ 
atively poor, the performance of training based algorithms will de¬ 
crease, while the performance of template fitting algorithms or mor¬ 
phological separation methods is independent of training data. Thus, 
when the weight maps of Figure and Figure are visually com¬ 
pared, it is clear that the BMC algorithm now uses more information 
from morphological separation and HB, while it uses considerably 
less information from our training based algorithms, TPC and SOMc. 
Not surprisingly, the morphological separation method performs best 



Figure 10. Similar to Figurej^but for the reduced training data set. 


at bright magnitudes, and BMC assigns more weight to HB at fainter 
magnitudes. 

We present the star and galaxy purity values as functions of i- 
band magnitude in Figure [TT] The normalized density distribution as 
a function of magnitude in the top panel and the stellar distribution in 
the second panel clearly show that the demographics of the training 
set and that of the test set are different. Since the training set is cut 
at z < 22, the density distribution falls off sharply around z ~ 22 
and has a higher fraction of stars than the test set. Compared to the 
purity values in Figure TPC now suffers a significant decrease in 
star and galaxy purity. However, the purity of BMC does not show 
such a significant drop and decreases by only 2-5%. As suggested by 
the weight maps in Figure BMC can accomplish this by shifting 
the relative weights assigned to each base classifier in different SOM 
cells. As the quality of training set worsens, BMC assigns less weight 
to training based methods and more weight to HB and morphological 
separation. 

In Figurewe show the cumulative galaxy and star purity val¬ 
ues as functions of magnitude. Compared to Figure]^ the drop in the 
performance of TPC is clear. However, even when some classifiers 
have been trained on a significantly reduced training set, BMC main¬ 
tains a galaxy purity of 0.970 and a star purity of 1.0 up to z ~ 24.5, 
and it sill outperforms morphological separation at fainter magnitudes 
*> 21 . 

We also show the star and galaxy purity values as functions of 
photo-z in Figure T^and as functions of ^ —r in Figure[^ Compared 
to Figure [^and|7 the performance of BMC becomes worse in some 
photo-z and g — r bins. However, this drop in performance seems to 
be confined to only a small number of objects in particular regions 
of the parameter space, and BMC still outperforms both TPC and 
morphological separation for the majority of objects. 

Compared to Figure the difference between the posterior star 
probability distribution of TPC and that of BMC is now more pro¬ 
nounced in Figure[^ The P (S) distribution of BMC for true galax¬ 
ies falls off sharply at P (S) ^ 0.95, and BMC does not assign a 
star probability P{S) > 0.95 to any true galaxies. On the other hand, 
both TPC and morphological separation classify some true galaxies 
as stars with absolute certainty. 


5.4 The Quality of Training Data 

The combination techniques that we have demonstrated so far use 
two training based algorithms as base classifiers. Ideally, the training 
data should mirror the entire parameter space occupied by the data 
to be classified. Yet we have seen in Section [531 that the BMC tech¬ 
nique does reliably extrapolate past the limits of the training data. 
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i (mag) 


Figure 11. Purity as a function of the i-band magnitude for the reduced train¬ 
ing data set. Top panel shows the histograms and KDEs for the number count 
distribution for the training (blue) and test (green) data set. The second panel 
shows the fraction of stars in the training and test data set in blue and green, 
respectively. The bottom two panels compare the galaxy and star purity values 
for BMC, TPC, and morphological separation as functions of z-band magni¬ 
tude. 


even when some base classifiers are trained on a low-quality train¬ 
ing data set. In this section, we further investigate if and where BMC 
begins to break down by imposing various magnitude, photo-z, and 
color cuts to change the size and composition of the training set. 

In Figure we present a visual comparison between different 
classification techniques, when various magnitude cuts are applied on 
the training data, and the performance is measured on the same test 
set from Section [5^ and |5.3| It is not surprising that the performance 
of TPC decreases as we decrease the size of training set by impos¬ 
ing more restrictive magnitude cuts, while the performance of HB 
and morphological separation is essentially unchanged. However, the 
effect of change in size and composition of the training set is signifi¬ 
cantly mitigated by the use of the BMC technique. BMC outperforms 
both HB and TPC in all four metrics, even when the training set is 
restricted to z < 20.0. BMC also consistently outperforms morpho¬ 
logical separation until we impose a magnitude cut of z < 20.0 on 
the training data, beyond which point BMC finally performs worse 
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Figure 12. Similar to Figure|^but for the reduced training data set. 


than morphological separation. It is remarkable that BMC is able to 
reliably extrapolate past the training data to z ~ 24.5, the limiting 
magnitude of the test set, and outperform HB, TPC, and morpholog¬ 
ical separation in all performance metrics, even the demographics of 
training set do not accurately sample the data to be classified. 

Similarly, we impose various spectroscopic redshift cuts on the 
training data in Figure Since all stars have Zspec values close to 
zero, we are effectively changing the demographics of training set by 
keeping all stars and gradually removing galaxies with high redshifts. 
BMC begins to perform worse than morphological separation when a 
conservative cut of ^spec < 0.6 is imposed. However, it is again clear 
that BMC is able to utilize information from HB and morphological 
separation to mitigate the drop in the performance of TPC. 

In Figure we decrease the size of training set by keeping 
red objects and gradually removing blue objects. A color cut seems 
to have a more pronounced effect on the performance of TPC and 
BMC, which perform worse than morphological separation when the 
training set is restricted to ^ — r > 0.4. The performance depends 
more strongly on the color distribution, because a significant fraction 
of blue objects consists of stars, while objects with fainter magni¬ 
tudes and higher redshifts are mostly galaxies. We can verify this in 
Figure]^ where the darker (higher stellar fraction) cells in the upper 
middle region of the stellar fraction map (top right panel) have bright 
magnitudes z < 20 in the z-band magnitude map (top left panel) and 
blue colors p — r < 0.5 in the ^ — r color map (bottom right panel). 
On the other hand, the darker (fainter magnitude) cells in the right- 
hand side of the z-band magnitude map have almost no stars in them 
and are represented by bright (low stellar fraction) cells in the stel¬ 
lar fraction map. Thus, these results indicate that the performance of 
training based methods depends more strongly on the composition of 
training data than on the size, and it is necessary to have a sufficient 
number of the minority class in the training data set to ensure optimal 
performance. 
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Figure 13. Similar to Figure [TT]but as a function of photo-z. 

6 CONCLUSIONS 

We have presented and analyzed a novel star-galaxy classification 
framework for combining star-galaxy classifiers using the CFHTLenS 
data. In particular, we use four independent classification techniques: 
a morphological separation method; TPC, a supervised machine 
learning technique based on prediction trees and a random forest; 
SOMc, an unsupervised machine learning approach based on self¬ 
organizing maps and a random atlas; and HB, a Hierarchical Bayesian 
template-fitting method that we have modified and parallelized. Both 
TPC and SOMc algorithms are currently available within a software 
package named ML2[^ Our implementation of HB and BMC, as well 
as IPython notebooks that have been used to produce the results 
in this paper, are available at https : //github. com/EdwardJKim/ 
astroclass 

Given the variety of star-galaxy classification methods we are us¬ 
ing, we fully expect the relative performance of the individual tech¬ 
niques to vary across the parameter space spanned by the data. We 
therefore adopt the binning strategy, where we allow different clas¬ 
sifier combinations in different parts of parameter space by creating 

^ http://lcdm.astro.illinois.edu/code/mlz.html 
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Figure 14. Similar to Figure [TT]but as a function of — r color. 


two-dimensional self-organizing maps of the full multi-dimensional 
magnitude-color space. We apply different star-galaxy classification 
techniques within each cell of this map, and find that the four tech¬ 
niques are weighted most strongly in different regions of the map. 

Using data from the CFHTLenS survey, we have considered dif¬ 
ferent scenarios: when an excellent training set is available with spec¬ 
troscopic labels from DEEP2, SDSS, VIPERS, and VVDS, and when 
the demographics of sources in a low-quality training set do not match 
the demographics of objects in the test data set. We demonstrate 
that the Bayesian Model Combination (BMC) technique improves the 
overall performance over any individual classification method in both 
cases. We note that |Carrasco Kind & Brunner| ( |2014a| l analyzed differ¬ 
ent techniques for combining photometric redshift probability density 
functions (photo-z PDFs) and also found that BMC is in general the 
best photo-z PDF combination technique. 

The problem of star-galaxy classification is a rich area for future 
research. It is unclear if sufficient training data will be available in 
future ground-based surveys. Furthermore, in large sky surveys such 
as DES and LSST, photometric quality is not uniform across the sky, 
and a purely morphological classifier alone will not be sufficient, es¬ 
pecially at faint magnitudes. Given the efficacy of our approach, clas¬ 
sifier combination strategies are likely the optimal approach for cur- 
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Figure 15. Similar to Figurej^but for the reduced training data set. 



rently ongoing and forthcoming photometric surveys. We therefore 
plan to apply the combination technique described in this paper to 
other surveys such as the DBS. Our approach can also be extended 
more broadly to classify objects that are neither stars nor galaxies 
(e.g., quasars). Finally, future studies could explore the use of multi¬ 
epoch data, which would be particularly useful for the next generation 
of synoptic surveys. 
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Figure 16. The classification performance metrics for BMC (blue), TPC 
(green), morphology (red), and HB (purple) as applied to the CFHTLenS data 
in the VVDS field with various magnitude cuts. The top panel shows the num¬ 
ber of sources in the training set at corresponding magnitude cuts. We show 
only one of the four combination methods, BMC, which has the best overall 
performance. 


for the Participating Institutions of the SDSS-III Collaboration in¬ 
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Figure 17. Similar to Figure [T^but using 2;spec cuts. 


Figure 18. Similar to Figure[T^but using g — r color cuts. 


0886. The participating institutions and funding agencies are listed at 
http://vipers.inaf.it/ 

This research uses data from the VIMOS VLT Deep Survey, 
obtained from the VVDS database operated by Cesam, Laboratoire 
d’Astrophysique de Marseille, France. 
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